3  Import

3.1 Overview

We ensure that the dataset, sourced from the Badminton World Federation (BWF) website and curated by Andrew Zhuang, is loaded into memory in a clean, consistent format. This prepares it for downstream analysis including feature engineering, modelling, and visualization.

The raw dataset is a CSV file containing match-level records from 2018 to 2025. It includes fields such as event type, players, match outcomes, and tournament metadata. Since scraped data can be messy, we apply a series of checks and transformations immediately on import to ensure quality.


3.2 Load the Data

We begin by defining paths and reading the CSV into a pandas DataFrame. Restricting to singles events (MS for Men’s Singles, WS for Women’s Singles) avoids mixing in doubles matches, which require different modelling due to partner synergy.

Required libraries are loaded. The model outputs to the S3 bucket badminton12345, where the training dataset is also stored in csv format.

Code
# train_model.py
import os, json, inspect
from pathlib import Path
from collections import defaultdict

import numpy as np
import pandas as pd
import networkx as nx
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, brier_score_loss
from xgboost import XGBClassifier
import joblib

# ---------- config ----------
BASE_DIR  = Path("/Users/yifanw124/STAT468/stat468-final-project")
DATA_PATH = BASE_DIR / "tournaments_2018_2025_June.csv"
OUT_DIR   = BASE_DIR
OUT_MODEL = OUT_DIR / "stack_model.joblib"
OUT_META  = OUT_DIR / "feature_spec.json"

PIN_TO_S3          = os.getenv("PIN_TO_S3", "false").lower() == "true"
USE_VETIVER_BUNDLE = os.getenv("USE_VETIVER", "false").lower() == "true"
RANDOM_STATE       = 42

MODEL_BUCKET = os.getenv("MODEL_BUCKET", "")          # used only if PIN_TO_S3
MODEL_PIN    = os.getenv("MODEL_PIN", "stack_model")  # also used as vetiver model_name
Code
# ---------- load ----------
df0 = pd.read_csv(DATA_PATH)
df0 = df0[df0["event"].str.contains("MS|WS", regex=True)].copy()
df0["date"] = pd.to_datetime(df0["date"])
df0 = df0.sort_values("date").reset_index(drop=True)